
\documentclass{article}

\setlength{\parindent}{0pt}
\addtolength{\parskip}{\baselineskip}

\usepackage{listings}
\usepackage[english]{babel}
\usepackage{fullpage}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amsthm}
\usepackage{fancyvrb}
\usepackage[noend]{algorithmic}
\usepackage{algorithm}
\usepackage{color}
\usepackage{subfigure}

\title{Cell SVD Project \\ CMSC 691A, Spring 2008}
\author{Luke~Georgalas, Andrew~M.~Raim \\ University of Maryland, Baltimore County}
\date{5/13/2008}

\newtheorem*{defn1}{Orthogonal Vectors}
\newtheorem*{defn2}{Orthogonal Matrix}
\newtheorem*{defn3}{Diagonal Matrix}
\newtheorem*{defn4}{Singular Value Decomposition}
\newtheorem*{defn5}{Rotation Matrix}

\renewcommand{\fboxsep}{0.2in}

\begin{document}
\maketitle

\begin{abstract}
In this project, we attempt an efficient implementation of a Singular Value Decomposition (SVD) algorithm on the 
Cell Broadband Engine architecture. Our work is based off of a streaming SVD algorithm implemented on the MIT RAW 
architecture.
This algorithm has many desirable qualities, notably its ability to be paralleled. However, RAW differs
from the Cell, which motivates us to design a customized implementation. Our eventual goal is to be able to
perform Latent Semantic Indexing (a technique used in information retrieval), and analyze large term-document
matrices
\end{abstract}

\section{Introduction}
Singular Value Decomposition (SVD) is a powerful mathematical technique, used for a variety purposes across
many fields. It is a specific factorization of a matrix, and is closely related to eigenvalue decomposition. One of
these uses is to determine, for multi-dimensional data, which dimensions contain the most interesting variations and
which are merely noise. It is often desirable to discard the less useful dimensions, and work with an approximation
to the original matrix.

Latent Semantic Indexing (LSI) is a technique in the field of Information Retrieval which is built on top of SVD. It
is used to extract semantic concepts from a set of documents, for a variety of analysis tasks.

LSI is very expensive to perform on real life sets of documents, because very large matrices are involved. In this 
paper, we will explore the implementation of an SVD on the IBM Cell Broadband Engine architecture (CBEA), with the 
objective being to enable LSI computation on real life sized sets of documents. 

\subsection{Singular Value Decomposition}

% Orthogonal vectors
\begin{defn1}
Two vectors $u, v \in \mathbb{R}^n$ are orthogonal if $u \cdot v = u^T v = u_1 v_1 + \cdots + u_n v_n = 0$
\end{defn1}

Note that $u \cdot u = u_1 u_1 + \cdots + u_n u_n = || u ||^2$. If $u$ is normalized, then $u \cdot u = 1$  

% Orthogonal matrix
\begin{defn2}
A matrix $Q$ is orthogonal if all row vectors are pairwise orthogonal, in other words $Q Q^T = Q^T Q = I$ (assuming
that each row has been normalized). In this case, $Q^T = Q^{-1}$ 
\end{defn2}

The product of two orthogonal matrices $Q_1, Q_2$ is also an orthogonal matrix

\begin{displaymath}
(Q_1 Q_2) (Q_1 Q_2)^T = Q_1 Q_2 Q_2^T Q_1^T = Q_1 I Q_1^T = Q_1 Q_1^T = I.
\end{displaymath}

% Diagonal matrix
\begin{defn3}
An $m \times n$ matrix $M$ is \emph{diagonal} if $M_{ij} = 0$ for all $i \neq j$. The remaining entries may or may not 
be non-zero
\end{defn3}

% SVD
\begin{defn4}
Suppose $A$ is an $m \times n$ matrix, where each entry $a_{ij} \in \mathbb{R}$. The singular value decomposition is

\begin{displaymath}
A = U S V^T  
\end{displaymath}

Where:
\begin{itemize}
\item $U$ is an $m \times m$ orthogonal matrix
\item $S$ is an $m \times n$ diagonal matrix, with the last $m - n$ rows containing all zero entries. The diagonal entries 
are known as the \emph{singular values}. By convention, the rows of $S$ are ordered by singular value, so that the top row
contains the largest value and so on.
\item $V^T$ is an $n \times n$ orthogonal matrix
\end{itemize}
\end{defn4}

% What do the singular values represent?

% Does this exist for every matrix A?

% How to approximate A by reducing dimensions
We can create an approximation $A'$ to the original $A$ by choosing the top $k$ singular values in $S$, and setting the 
rest to zero.

Computing the SVD for large matrices can be expensive. The singular values are not readily apparent in most matrices, so most SVD algorithms
apply a series of transformations to convert the original into some some form where this is the case.

\subsection{Latent Semantic Indexing}
Latent Semantic Indexing (LSI) is a technique used in field of Information Retrieval to semantically represent the relationship
between terms, documents, and concepts. It is based on the SVD. LSI is based on the term-document matrix - a matrix of $M$ rows (terms)
by $N$ columns (documents), with each entry containing the frequency of the term in the corresponding document. This matrix is
built from a \emph{corpus} of documents. 

\begin{displaymath}
A =  
\underbrace{
\left[
\begin{array}{cccc}
  freq_{1,1} &   freq_{1,2} &  \cdots & freq_{1,n} \\
  freq_{2,1} &   freq_{2,2} &  \cdots & freq_{2,n} \\
  \vdots  &   \vdots &   \ddots & \vdots  \\
  freq_{m,1} &   freq_{m,2} &  \cdots & freq_{m,n} \\
\end{array}
\right] }_{\textrm{(documents)}}
\textrm{(terms)}
\end{displaymath}

An SVD is performed on this matrix to obtain  

\begin{displaymath}
\underbrace{
\left[
\begin{array}{cccc}
  t_{1,1} &   t_{1,2} &  \cdots & t_{1,m} \\
  t_{2,1} &   t_{2,2} &  \cdots & t_{2,m} \\
  \vdots  &   \vdots &   \ddots & \vdots  \\
  t_{m,1} &   t_{m,2} &  \cdots & t_{m,m} \\
\end{array}
\right] }_{T}
%
\underbrace{
\left[
\begin{array}{cccc}
  s_{1,1} &   0 &  \cdots & 0 \\
  0 &   s_{2,2} &  \cdots & 0 \\
  \vdots  &   \vdots &   \ddots & \vdots  \\
  0 &   0 &  \cdots & s_{n,n} \\
  0 &   0 &  \cdots & 0 \\
  \vdots  &   \vdots &   \ddots & \vdots  \\
  0 &   0 &  \cdots & 0 \\
\end{array}
\right] }_{S}
%
\underbrace{
\left[
\begin{array}{cccc}
  d^T_{1,1} &   d^T_{1,2} &  \cdots & d^T_{1,n} \\
  d^T_{2,1} &   d^T_{2,2} &  \cdots & d^T_{2,n} \\
  \vdots  &   \vdots &   \ddots & \vdots  \\
  d^T_{n,1} &   d^T_{n,2} &  \cdots & d^T_{n,n} \\
\end{array}
\right] }_{D^T}
\end{displaymath}

Where:
\begin{itemize}
\item $S$ is the matrix of singular values, which represent semantic concepts
\item $T$ maps sets of terms to these concepts
\item $D^T$ maps from these concepts to a set of documents  
\end{itemize}

Each singular value in $S$ represents the relative dominance of each dimension in \emph{concept space}. Concept space is a 
collection of concepts - words which are related in the corpus. Words are linearly combined using weights to form a 
\emph{concept}.

For large \emph{corpora} of documents, performing LSI can be prohibitively expensive because of the work to calculate the SVD. 

Once computed, some of the tasks that can be performed with LSI include:
\begin{itemize}
\item Comparing documents in a corpus to see how related they are
\item Comparing terms in a corpus to see how related they are
\item Comparing terms with documents
\item Incorporating new documents into the concept space and find closest matches (queries can be considered as documents)
\end{itemize}

\subsection{Cell Broadband Architecture}
The IBM Cell Broadband Engine is a hardware architecture designed for efficient parallelism. Some of its features include:
\begin{itemize}
\item A traditional "control" CPU called the PowerPC Processing Element (PPE), surrounded by a series (currently six or eight) Synergistic
Processing Elements (SPEs)
\item PPE is intended for control, and contains a full PowerPC instruction set 
\item SPEs are specialized CPUs intended for fast computations
\item Both processors types contain support for Single Instruction Multiple Data (SIMD) 
\item SPEs are controlled from the PPE, but can operate independently once instructions are given
\item Large amount of bandwidth available between PPE, SPEs, and main memory
\item SPEs cannot access memory directly, but operate off of a local store which is $256$ KB. Each SPE has a memory flow controller that it can
use to fetch outside data into its store.
\item There are also \emph{mailbox} and \emph{signal} mechanisms available for inter-CPU communication without memory accesses  
\item The programming language of choice for the Cell is C, with an added set of \emph{intrinsic} operations to take advantage of low level
features in the architecture
\end{itemize}

We will attempt to take advantage of this architecture's strengths, to create an efficient SVD implementation that will scale to a large number of
processors. Our test platform is the Sony Playstation 3, featuring a single Cell chip with 6 SPEs, running Yellowdog Linux as the operating system.
They have $256$ MB of main memory and $30$ GB of storage (with some allocated for the gaming OS and the Linux distribution) 

\section{Project Objectives}
Our objective is to make the SVD feasible on very large matrices used in LSI. An example is the Reuters Corpus.

The Reuters Corpus is a collection of SGML files, each of which is a collection of Reuters news articles from 1987. The articles are
listed by an identifier, which we associated with an ID/index of our own. We also took each word in the corpus and assigned it an 
ID/index. We pulled the body and title of each article out, and created a term-document matrix, where the rows of the matrix are 
term IDs in the corpus, the columns are the document IDs, and the matrix elements are how many times a particular term appeared in a
particular document. It is possible vary the size of the term-document matrix, while maintaining consistency, by using a reduced number 
of documents.

Some of the features of the matrix representing the full corpus are:
\begin{itemize}
\item Number of terms $M = 54574$
\item Number of documents $N = 19043$
\item Total number of entries in $A_{m,n}$ is $54574 \times 19043 = 991$ MB
\item Total number of non-zero entries is $1215837$, which is only $0.117 \%$ of the total
\item If each entry is represented by a single byte, the full matrix would be $991$ MB but the non-zeroes would be only $1.16$ MB
\item The SVD components (as well as the working copy of $A$) would require floating point elements for storage:
  \begin{itemize}
  \item $A_{m,n}$ : $54574 \times 19043 \times 4 = 3.872$ GB
  \item $U_{m,m}$ : $54574^2 \times 4 = 11.095$ GB
  \item $S[1 : n]$ : $19043 \times 4 = 74.387$ KB, if we store only the diagonal
  \item $V^T_{n,n}$ : $19043^2 \times 4 = 1.351$ GB
  \end{itemize}
\end{itemize}

It is possible to represent the original $A$ matrix sparsely and save a signification amount of space. Unfortunately due to the nature of our rotations,
the working copy of $A$ will quickly have zero entries transformed into non-zeroes. Using a sparse matrix will not necessarily be helpful in this case,
and may significantly impair due to the amount of allocation (and possibly deallocation) that may be required during the course of the algorithm. Also the
resulting $U$ and $V$ matrices may be very dense, even for a sparse $A$.

It is apparent that storage becomes a limiting factor in this algorithm. It would be worth investigating other methods that avoid the need to expand the 
original $A$ matrix to a dense representation or explicitly calculate $U$ or $V$.

Our goal for the moment is to achieve good performance for the general purpose SVD calculation on the Cell. We hope to show that out algorithm scales well
on smaller problems to multiple SPEs, to demonstrate that larger problems like the Reuters Corpus could be computed on larger scale Cell platforms.

\section{Literature review}
Our project is based primarily off of the work in Strumpen \cite{strumpen}. The authors of this paper describe several classical
SVD algorithms that do not parallelize well, including Golub-Kahan-Reinsch and the classical Jacobi algorithm. They describe
the Hestenes-Jacobi algorithm which is more inclined for parallelization, and base their implementation off of this.

\subsection{Orthogonalizing the Matrix}
The idea behind Hestenes-Jacobi is to transform the row vectors of the original matrix $A$ so that (1) all rows are pairwise orthogonal
to each other, and (2) the norms of the original vectors are preserved.

\begin{defn5}
A matrix operation $Q$ is a \emph{rotation} if it preserves vector length so that $|| Q v || = || v || $. 
\end{defn5}

It turns out that the class of orthogonal matrices is exactly the same as the class of rotation matrices \cite{wiki:rotation}. 

Rotations are iteratively applied to pairs of rows from the original matrix, until sufficient orthogonality has been reached at step $t$. 
Suppose rotation $\Phi^{(i)}$ is applied on the $i$th iteration of the algorithm, to obtain result $A^{(i)}$. We can accumulate these rotations
into a single rotation

\begin{displaymath}
U^T \gets \Phi^{(t)} \Phi^{(t-1)} \cdots \Phi^{(2)} \Phi^{(1)} I 
\end{displaymath}

Since $U^T$ is a product of orthogonal transformations, it is itself orthogonal. Then $U^T = U^{-1}$, and we can use it in the definition of the SVD
to find the singular value matrix $S$, and the other orthogonal matrix $V^T$

\begin{eqnarray*}
A^{(t)} &=& \Phi^{(t)} \Phi^{(t-1)} \cdots \Phi^{(2)} \Phi^{(1)} A \\
&=& U^T A \\
&=& S V^T \\
&=& 
\left[
\begin{array}{cccc}
  s_1     & 0        & \cdots & 0 \\
  0       & s_2      & \cdots & 0 \\
  \vdots  &   \vdots & \ddots & \vdots  \\
  0       & 0        & \cdots & s_n \\
  0       & 0        & \cdots & 0 \\
  \vdots  &   \vdots & \ddots & \vdots  \\
  0       & 0        & \cdots & 0
\end{array}
\right]
\left[
\begin{array}{cccc}
  v^T_{1,1} &   v^T_{1,2} &  \cdots & v^T_{1,n} \\
  v^T_{2,1} &   v^T_{2,2} &  \cdots & v^T_{2,n} \\
  \vdots    &   \vdots    &  \ddots & \vdots  \\
  v^T_{n,1} &   v^T_{n,2} &  \cdots & v^T_{n,n} \\
\end{array}
\right] \\
&=& 
\left[
\begin{array}{cccc}
  s_1 v^T_{1,1} &   s_1 v^T_{1,2} &  \cdots & s_1 v^T_{1,n} \\
  s_2 v^T_{2,1} &   s_2 v^T_{2,2} &  \cdots & s_2 v^T_{2,n} \\
  \vdots        &   \vdots        &  \ddots & \vdots  \\
  s_n v^T_{n,1} &   s_n v^T_{n,2} &  \cdots & s_n v^T_{n,n} \\
\end{array}
\right]. \\
\end{eqnarray*}

Therefore
\begin{equation}
S_{i,i} = || A^{(t)}[i] || 
\quad \textrm{and} \quad
v^T[i] = \frac{ A^{(t)}[i] }{ || A^{(t)}[i] || }
\qquad \textrm{for } i = 1, \cdots, n
\end{equation}

Note that instead of accumulating $U^T$ explicitly, it can also be calculated from the original $A$ and resulting $A^{(t)}$
\begin{eqnarray*}
A^{(t)} &=& U^T A \\
I &=& U^T A \, A^{(t) T} \\
I^T &=& \left( U^T A \, A^{(t) T} \right)^T \\
I &=& A^{(t)} A^T U  \\
U^T &=& A^{(t)} A^T \\
\end{eqnarray*}

\subsection{Jacobi Transform}
The specific rotation that the authors use is called the Jacobi transform. It is defined as $J = I_{m,n}$ with

\begin{eqnarray*}
J_{i,i} &=& cos\,\theta \\
J_{i,j} &=& sin\,\theta \\
J_{j,i} &=& -sin\,\theta \\
J_{j,j} &=& cos\,\theta \\
\end{eqnarray*}

for given row indices $i$ and $j$ ($i \neq j$).  The structure of $J$ is as follows

\begin{displaymath}
J = 
\left[
\begin{array}{ccccccc}
  1 &         &                &         &                &         &   \\
    & \ddots  &                &         &                &         &   \\
    &         &   cos\,\theta  &         &  -sin\,\theta  &         &   \\
    &         &                & \ddots  &                &         &   \\
    &         &   sin\,\theta  &         &   cos\,\theta  &         &   \\
    &         &                &         &                &  \ddots &   \\
    &         &                &         &                &         & 1 \\
\end{array}
\right].
\end{displaymath}

It can be shown that this transformation is orthogonal, and is therefore a rotation. It can also be shown that it only affects elements $i$ and $j$
in any vector it is applied to. Or if applied to a data matrix, it will only affect rows $i$ and $j$. This is crucial for parallelizing the SVD, as we will
see later.

\begin{displaymath}
J^T A =
\left[
\begin{array}{ccccccc}
  1 &         &                &         &                &         &   \\
    & \ddots  &                &         &                &         &   \\
    &         &   cos\,\theta  &         &  -sin\,\theta  &         &   \\
    &         &                & \ddots  &                &         &   \\
    &         &   sin\,\theta  &         &   cos\,\theta  &         &   \\
    &         &                &         &                &  \ddots &   \\
    &         &                &         &                &         & 1 \\
\end{array}
\right]^T
\left[
\begin{array}{cc}
A[1] \\
A[2] \\
\vdots \\
A[m] \\
\end{array}
\right] \\
\end{displaymath}

\begin{eqnarray}
\left[
\begin{array}{cc}
A[i]_{new} \\
A[j]_{new} \\
\end{array}
\right]
&=&
\left[
\begin{array}{cc}
cos\,\theta  &  -sin\,\theta  \\
sin\,\theta  &   cos\,\theta  \\
\end{array}
\right]^T
\left[
\begin{array}{c}
A[i] \\
A[j] \\
\end{array}
\right]  \nonumber \\
A[i]_{new} &=& A[i] \cdot cos\,\theta - A[j] \cdot sin\,\theta \\
A[j]_{new} &=& A[i] \cdot sin\,\theta + A[j] \cdot cos\,\theta
\end{eqnarray}

\begin{equation}
A[k]_{new} = A[k] \qquad \textrm{ for } k \neq i \textrm{ and } k \neq j
\end{equation} 

The rotation angle $\theta$ is chosen based on the angle $\phi$ between the row vectors $A[i]$ and $A[j]$, so that $\phi + \theta =
\frac{\pi}{2}$. Figure \ref{fig:rot_example} illustrates this process.

\begin{figure}
\centering
\subfigure[Acute correction]{
\fbox{\includegraphics[scale=0.5]{graphics/AcuteRotation.png}}
\label{fig:rot_example_acute}
}
\subfigure[Obtuse correction]{
\fbox{\includegraphics[scale=0.5]{graphics/ObtuseRotation.png}}
\label{fig:rot_example_obtuse}
}
\caption{Illustration of rotations to correct acute angle \subref{fig:rot_example_acute} and obtuse angle \subref{fig:rot_example_obtuse} 
while orthogonalizing a pair of row vectors}
\label{fig:rot_example}
\end{figure}



\subsection{Stream SVD}
Stream SVD is an implementation of the Hestenes-Jacobi algorithm, designed to run on an $R \times R$ grid of processors. Matrix rows are streamed
into the grid in a systolic manner, so that up to $2R$ rows are rotated in one iteration. This optimization requires two passes through the rows: 
one pass to compute the rotation angle between row pairs, and one pass to apply the computed rotation to row pairs.

The rows of the original $A$ matrix are partitioned into blocks of size $R$ (with the last block potentially having a remainder). Each pair of blocks
from this partitioning becomes a job that must pass through the processing grid.

Rotations are not computed and applied serially, but in a batch. There is a possibility that they will counteract each other to some degree, and the
resulting rows will not be completely orthogonal. Therefore the entire algorithm must be executed sequentially until convergence is achieved. The authors 
give bounds on the amount of "missed opportunities for propagating information", and empirical results that show encouraging convergence behavior
when $R \leq \sqrt{N}$ for $N \times N$ matrices  

The authors give the order of multiply/add operations in terms of the matrix dimensions $M$ and $N$ as

\begin{displaymath}
C(M, N) = \Theta \big( (5N + 4) \cdot \frac{1}{2} M(M - 1) \big)
\end{displaymath}


\section{Cell SVD Algorithm}
We attempt a customized implementation of the Stream SVD algorithm on the Cell. Although the Cell is designed to support
fast interprocessor (inter-SPE) communication, it is better to allow the SPEs to stream though data independently of each other. 
Synchronizing communications for each calculation step would lead to a significant amount of overhead. We 
also want to avoid the need for a square array of processors. The Playstation 3 comes with 6 SPEs, and commercial Cell 
blades come with 8 SPEs, and we would like to use all available processors to aid in our computation.

We use many of the same concepts as the Stream SVD algorithm, with one very significant difference - the way we parallelize
the work. We will now describe this in detail.

\subsection{The $Work$ Matrix}

Recall our original data matrix
\begin{displaymath}
A_{m, n} = 
\left[
\begin{array}{cccc}
  a_{1,1} &   a_{1,2} &  \cdots & a_{1,n} \\
  a_{2,1} &   a_{2,2} &  \cdots & a_{2,n} \\
  \vdots  &   \vdots  &  \ddots & \vdots  \\
  a_{m,1} &   a_{m,2} &  \cdots & a_{m,n} \\
\end{array}
\right]
\end{displaymath}

As in Stream SVD, group the rows of $A$ into row blocks of size $R$. Call the resulting blocks $RB = {rb_0, \cdots, rb_k}$, where
$k = \lceil \frac{M}{R} \rceil$

% Use brackets on the original A matrix if possible, so this doesn't look like crap
\begin{displaymath}
\begin{array}{|c|}
  \hline
  rb_0 \textrm{: rows } 1 \textrm{ to } R \\ \hline
  rb_1 \textrm{: rows } R+1 \textrm{ to } 2R \\  \hline
  \vdots  \\ \hline
  rb_k \textrm{: rows } kR+1 \textrm{ to } M \\ \hline
\end{array}
\end{displaymath}

The $Work$ matrix captures all pairs of row blocks in $RB \times RB$

\begin{displaymath}
Work_{k,k} = 
\begin{array}{|c|c|c|c|}
  \hline
  (rb_1, rb_1) &   (rb_1, rb_2) &  \cdots & (rb_1, rb_k) \\ \hline
  (rb_2, rb_1) &   (rb_2, rb_2) &  \cdots & (rb_2, rb_k) \\ \hline
  \vdots       &   \vdots       &  \ddots & \vdots  \\ \hline
  (rb_k, rb_1) &   (rb_k, rb_2) &  \cdots & (rb_k, rb_k) \\ \hline
\end{array}
\end{displaymath}

Only the upper-triangular portion of the entries in this matrix represent jobs which need to be computed. For
a given $Work$ matrix entry $Work_{i,j} = (rb_i, rb_j)$, we can find the row indices into the original data matrix $A$ as
follows:

\begin{eqnarray*}
lb_i &=& i * R \\
ub_i &=& min(lb_i + R, M) \\
lb_j &=& j * R \\
ub_j &=& min(lb_j + R, M) \\
A[lb_i : ub_i] & = & \textrm{ rows for } rb_i \\
A[lb_j : ub_j] & = & \textrm{ rows for } rb_j \\
\end{eqnarray*}

\subsection{Processing a Chunk of Work}
We will now describe the processing steps that must occur for each $Work_{i,j}$ entry in the $Work$ matrix.

\begin{algorithm}
\caption{High level description of processing a work chunk $[lb_i, ub_i] \times [lb_j, ub_j]$}
\label{alg:workchunk}
\input{workchunk.tex}
\end{algorithm}

\begin{algorithm}
\caption{Compute a Jacobi rotation for two vectors $v_1, v_2$, with $a = ||v_1||, b = ||v_2||, g = v_1 \cdot v_2 $}
\label{alg:jacobi}
\input{jacobi.tex}
\end{algorithm}

The $sign$ function is defined as

\begin{displaymath}
\end{displaymath}

\begin{equation}
sign(x) = \left\{
\begin{array}{rl}
-1 & \text{if } x < 0 \\
1 & \text{otherwise}
\end{array} 
\right.
\end{equation}

We can make several observations while studying Algorithm \ref{alg:workchunk}. First, all rows in $(rb_i, rb_j)$ will potentially be 
updated during this operation. So if we are to run multiple chunks in parallel, we must ensure that no row will be processed
in more than one chunk.

We also observe that our computations are dominated by several phases:
\begin{itemize}
\item computing norms for each vector
\item computing dot products for pairs of vectors
\item computing a jacobi rotation for each pair, using the norms and dot products
\item and finally applying the computed rotation to each pair
\end{itemize}

Using the straightforward method in Algorithm \ref{alg:workchunk}, we will make $O(R)$ passes through each row vector. However, if we refer back 
to the systolic method used in Strumpen \cite{strumpen}, we can reduce the number of passes to two. We can do this by processing all rows in 
$(rb_i, rb_j)$ at once, element-by-element. While doing so, we can accumulate all norms and dot products. Then we can
compute all rotations at once. With one more element-by-element pass through all the rows, we can apply all rotations.

This assumes that it is reasonable to change the order of operations; instead of calculating and applying rotations sequentially
through each pair of rows, we are now calculating all rotations independently, then applying them at once. Fortunately, Strumpen's
results showed that this does not interfere with the results too drastically - not too many extra rotations are needed for
convergence - when $R \leq \sqrt{N}$, for $N \times N$ matrices.

\begin{algorithm}
\caption{Systolic-style rotation calculation}
\label{alg:workchunk_calculation}
\input{workchunk_calculation.tex}
\end{algorithm}

\begin{algorithm}
\caption{Systolic-style rotation application}
\label{alg:workchunk_application}
\input{workchunk_application.tex}
\end{algorithm}

\subsection{Work Control}
That there may be dependencies between the individual chunks of work, so we must take care to parallelize their execution. We 
would like to run as many jobs $(Work_{i_1,j_1}, Work_{i_2,j_2}, \cdots,  Work_{i_n,j_n})$ as possible at the same time, up to the
number of 
processing units available to us. The constraints that must be met are that $(i_1 \neq i_2 \neq \cdots \neq i_n$ and 
$(j_1 \neq j_2 \neq \cdots \neq j_n)$. A reasonable way to meet these would be to process one diagonal of the $Work$ matrix at a time.
This is illustrated in Figure \ref{fig:diag_work_control}.

\begin{figure}
\begin{displaymath}
\rightarrow
\begin{array}{|c|c|c|c|}
  \hline
  \color{red} t_1 &  &  & \\ \hline
    &   \color{red} t_1 &  & \\ \hline
    &  & \color{red} t_1 & \\ \hline
    &  &  & \color{red} t_1 \\ \hline
\end{array}
\rightarrow
\begin{array}{|c|c|c|c|}
  \hline
  t_1 & \color{red} t_2 &  & \\ \hline
    & t_1 &   \color{red} t_2 & \\ \hline
    &  & t_1 & \color{red} t_2 \\ \hline
    &  &  & t_1 \\ \hline
\end{array}
\end{displaymath}

% Last two frames of animation - shows up as lower row
\begin{displaymath}
\rightarrow
\begin{array}{|c|c|c|c|}
  \hline
  t_1 & t_2 & \color{red} t_3 & \\ \hline
    & t_1 & t_2 & \color{red} t_3 \\ \hline
    & & t_1 & t_2 \\ \hline
    & & & t_1 \\ \hline
\end{array}
\rightarrow
\begin{array}{|c|c|c|c|}
  \hline
  t_1 & t_2 & t_3 & \color{red} t_4 \\ \hline
    & t_1 & t_2 & t_3\\ \hline
    & & t_1 & t_2 \\ \hline
    & & & t_1 \\ \hline
\end{array}
\end{displaymath}
\caption{In order to maintain result consistency during parallelization, process one diagonal of the $Work$ matrix at a time}
\label{fig:diag_work_control}
\end{figure}

The algorithm for work control using diagonals of the $Work$ matrix is given in Algorithm \ref{alg:workcontrol}

\begin{algorithm}
\caption{Work Control algorithm}
\label{alg:workcontrol}
\input{workcontrol.tex}
\end{algorithm}

\subsection{Implementation}
We will now describe the implementation of CellSVD, based on the previously discussed algorithms

The PPE is responsible for coordinating work between the SPEs. It ensures that

\begin{itemize}
\item The input matrix is loaded into main memory, and accessible by SPEs
\item At most, one diagonal of the $Work$ matrix is being operated on at any given time
\item Each entry of the active diagonal is processed by the next available SPE
\item Information about convergence is collected from each SPE after it finishes a work chunk
\item Processing occurs until convergence is achieved
\end{itemize}

Allocating new Pthreads for each SPE job would be very time consuming, so we avoid this. The
PPE starts all available SPE threads at the beginning of the program, communicates inputs and outputs 
with them using a simple mailbox protocol, and terminates them at the end of the program. The PPE's protocol
is shown in Figure \ref{fig:ppe_comm}, and the SPE's protocol is shown in Figure \ref{fig:spe_comm}.

\begin{figure}
\centering
\fbox{\includegraphics[width=5in]{graphics/MailProtocol-PPE.png}}
\caption{PPE work control using interprocessor mailbox communication}
\label{fig:ppe_comm}
\end{figure}

\begin{figure}
\centering
\fbox{\includegraphics[width=2.5in]{graphics/MailProtocol-SPE.png}}
\caption{SPE work control using interprocessor mailbox communication}
\label{fig:spe_comm}
\end{figure}

The $A$ and $A^{(i)}$ matrices are represented in memory as a contiguous row-major block of $4$ byte floating point numbers. Each SPE is
responsible for reading the appropriate rows into its local store, as they are needed. Rows may be too large to fit into the $256$ KB of
space (also shared with code), so they are brought over in small chunks - $32$ elements or $128$ bytes.This particular size was chosen
to make efficient use of DMAs, since they bring over $128$ byte fragments of memory per operation.

We employ the technique of double buffering to bring data into SPE local store while calculations are occurring. This is used when:
\begin{itemize}
\item reading a $128$ byte row fragment during the calculation phase
\item reading a $128$ byte row fragment during the application phase
\item writing a $128$ byte row fragment during the application phase
\end{itemize}

Double buffering is possible on the Cell because the SPEs do not block when a DMA has been issued, unless the programmer chooses to do so. Note that after we apply a rotation to a pair of $128$ byte row fragments and issue a DMA to write results back into memory, there is no need to wait for the DMA to complete before moving to the next pair of 
fragments.

We take advantage of the SIMD instructions available on the SPE, and vectorize the norm and dot production operations. Rows are processed
as $4$-element \emph{vector float}s during the calculation phase. The \emph{spu\_madd} function is then used to perform the following calculation
in one instruction

\begin{displaymath}
 \vec{d} = spu\_madd( \vec{a}, \vec{b}, \vec{c}) = (a_1 b_1 + c_1, a_2 b_2 + c_2, a_3 b_3  + c_3, a_4 b_4 + c_4). 
\end{displaymath}

The norm calculation can be achieved by setting $\vec{a}$ and $\vec{b}$ to the same row vector, and the dot product can be achieved by setting them
to a pair of row vectors. 

%TBD: How much space will we use, and how do we set $R$?

\section{Experiments}
The code was developed in several iterations, with incremental improvements in each version:
\begin{itemize}
\item CellSVD Iterations 1 and 2 operate on the PPE only. Unfortunately they are incorrect because they attempt to operate on a sparse (but not expandable) matrix
\item CellSVD Iteration 3 runs on the PPE and SPEs, but is lacking some optimizations
\item CellSVD Iteration 4 is our most highly optimized version, using vector operations and double-buffering. 
\end{itemize}

We did not implement the systolic-style iterations mentioned earlier, but believe the use of double-buffering will hide the effect of the extra memory accesses, which occur each time a $32$ element fragment of a row is used. We also are not calculating the $U$ and $V^T$ matrices, but as mentioned earlier those can be derived later.

We will compare Iterations 3 and 4 with GNU Octave's \cite{eaton2002} SVD calculation 

\subsection{Experiment I}
In this experiment we process a $500 \times 150$ matrix containing randomly generated elements. We compare our results to those in Octave

\begin{Verbatim}[fontsize=\footnotesize]
iteration 3, R = 10, NumThreads = 6
PPE: Done matrix[500][150]: time=3.228537 iterations=22
PPE: Done matrix[500][150]: time=3.243955 iterations=22
PPE: Done matrix[500][150]: time=4.146658 iterations=29
PPE: Done matrix[500][150]: time=4.555774 iterations=32

iteration 4, R = 10, NumThreads = 6
PPE: Done matrix[500][150]: time=13.582559 iterations=115
PPE: Done matrix[500][150]: time=9.900645 iterations=83
PPE: Done matrix[500][150]: time=10.394578 iterations=87
PPE: Done matrix[500][150]: time=15.640144 iterations=133

The resulting singular values appear to be off by a signifant amount, although "in the ballpark" 
For example, the largest SV is off by |732.410095 - 706.496887| = 25.91321
When setting R = 1 in iteration 4, that SV is much closer - 732.406799
\end{Verbatim}


\subsection{Experiment II}
In this experiment we vary matrix size, $R$, and the number of threads, and examine the scaling qualities of Iteration 4. Again, the matrices are
composed of random elements

\begin{Verbatim}[fontsize=\footnotesize]
iteration 3, R = 10, NumThreads = 6
PPE: Done matrix[500][150]: time=3.228537 iterations=22
PPE: Done matrix[500][150]: time=3.243955 iterations=22
PPE: Done matrix[500][150]: time=4.146658 iterations=29
PPE: Done matrix[500][150]: time=4.555774 iterations=32

iteration 4, R = 10, NumThreads = 6
PPE: Done matrix[500][150]: time=13.582559 iterations=115
PPE: Done matrix[500][150]: time=9.900645 iterations=83
PPE: Done matrix[500][150]: time=10.394578 iterations=87
PPE: Done matrix[500][150]: time=15.640144 iterations=133

The resulting singular values appear to be off by a signifant amount, although "in the ballpark" 
For example, the largest SV is off by |732.410095 - 706.496887| = 25.91321
When setting R = 1 in iteration 4, that SV is much closer - 732.406799
\end{Verbatim}

\subsection{Experiment III}
In this experiment we process a small subset of the Reuters Corpus.  As we discussed earlier, this data set is too 
large to be processed in its entirety. We choose $2668$ terms and $91$ documents to yield a $2668 \times 91$ matrix. 

\begin{Verbatim}[fontsize=\footnotesize]
Timing on Cell? R = ? threads = ?
Timing in Octave?
\end{Verbatim}

\subsection{Experiment IV}
In our final experiment, we examine the effect of sparseness on run time

\begin{Verbatim}[fontsize=\footnotesize]
M = 512, N = 512

Sparsenesses:
1 
0.75 
0.5 
0.25 
0.01 

Timing on Cell? R = ? threads = ?
Timing in Octave?
\end{Verbatim}

\section{Conclusions}
We have implemented an SVD algorithm on the IBM Cell Broadband, and attempted to take advantage of that architecture's strengths. These strengths
include multiple cores, fast memory access, fast interprocessor communication, and SIMD instuctions

TBD: How do the results look?

As mentioned earlier, one opportunity that we have not taken advantage of is the sparseness of term-document matrices. Working with dense 
representations of these matrices is prohibitively expensive. A suitable algorithm would be capable of operating on a sparse matrix representation, 
with a bounded number of non-zero elements at any point during execution. Ideally, no zero elements would need to be made non-zero, so that dynamic 
memory allocation could be avoided. Algorithms which take the row orthogonalization approach (such as Hestenes-Jacobi) are not suitable for this 
goal. However, algorithms that take the diagonalization approach (such as the classical Jacobi method, which does not parallelize nicely) may be worth 
investigating. Other methods besides the SVD have been suggested for computing reduced-rank approximations of sparse matrices, for example see \cite{Berry}.

There are a several modifications to the algorithm that would be worth investigating. We have focused on calculating the full SVD, but in a real application 
it may be desirable to calculate it incrementally if only a few rows have changed. Also if the number of desired dimensions $k$ is known ahead of time, it 
would be desirable to calculate only the first $k$ singular values, and not incur the full SVD cost.


% Make sure to implicitly cite all of our references here, if we didn't cite them explicitly earlier
\nocite{*}

\bibliographystyle{plain}
\bibliography{project-refs}

\end{document}


